Neural Network for Digit Classification

Author: Emily Cheng-hsin Wuu

Maximum possible score = 100 points

Assignment Overview

Deep learning has quickly become one of the more popular applied machine learning techniques in computer vision. Convolutional neural networks have been applied to many different computer vision problems such as image classification, recognition, and segmentation with great success.

In this assignment, you will first implement a fully-connected feed-forward neural network for hand-written digit classification, which involves going into the details of network parameter initialization, activation functions, loss functions and gradient computation. You will go through the forward pass and the backward propagation demistify neural networks. Next, you will move your work to PyTorch, where you can learn to use a modern deep learning framework to classify digits.

Learning Objectives

Source

This assignment was derived, in part, from assignments used in conjunction with computer vision courses at Carnegie Mellon University:

Table of Contents

Part I: Implement a Fully-connected Neural Network [100 pts]

Part 2: PyTorch for Digit Classification [0 pts]

Part I: Implement a Fully-connected Neural Network [ 45 pts]

We will give a brief overview of the math for a feed forward network with a single hidden layer. There are two parts: (1) forward propagation where the network learns from the input data; (2) backward propagation where the network updates its parameters based on the loss function. This will help you to understand the implementation of different functions required for building a simple two-layer neural network at the end of this section.

Forward propagation

To do classifcation with a fully-connected network $f$, you need to apply a series of linear and non-linear functions to an input vector $x$ of size $N \times 1$ to generate an output vector $f(x)$ of size $C \times 1 $, where each element of the output vector represents the probability of input vector $x$ belonging to certain class in the dataset. The input layer has $N$ input units as the dimension of the data samples is $N$. In between the input and output layers, we need to compute the values of each units from all of the hidden layers. For any unit $x$ from the pre-activated hidden layers, we can write it as

$a^{(1)}(x) = W^{(1)}x + b^{(1)}$

Next, it will go through an activation function $g$, and the post-activatedvalues of the hidden layer $h^{(1)}(x)$ will be:

$h^{(1)}(x)= g(a^{(1)}(x)) = g(W^{(1)}x + b^{(1)})$

Following this pattern, we can write any hidden layer ($1 < t \leq T $) pre- and post- activations by:

$a^{(t)}(x) = W^{(t)}h^{(t-1)} + b^{(t)}$
$h^{(t)}(x) = g(a^{(t)}(x))$

Finally, the post-activation values of the output layer are computed with:

$f^{(x)} = o(a^{(T)}(x)) = o(W^{(T)}h^{(T - 1)}(x) + b^{(T)})$

where $o$ is the output activation function. For most of the case activation function $g$ and $o$ will be different. For this assignment, you will be using the sigmoid activation function for the hidden layer (activation function $g$):

$g(y) = \frac{1}{1 + exp(y)}$

where when $g$ is applied to a vector, it is applied element wise across the vector.

Since you are working on classification tasks, you will use softmax function as the output activation function ($o$) to turn the real value, possibly negative values of $a^{(T)}(x)$ into a set of probabilities (vector of positive numbers that sum to 1) that represents the possibility of each element belonging to certain class in the dataset. Letting $x_{i}$ denote the $i$th element of the vector $x$, the softmax function is defined as:

$o_{i}(y) = \frac{exp(y_{i})}{\sum_{j}{exp(y_{j})}} $

After your network outputs the predicting probabilities of each class, you arrive at your final destination of the forward propagation: computing the loss function. Here, the loss function is used to compute the difference between the ground truth (true value), and the prediction from the network. A properly-functioning network should have the prediction as close to the ground truth as possible, meaning that the difference of these two terms should be somewhere close to zero. For this programming assignment, you will use cross-entropy loss, which is generally used for classification tasks:

$L_{f}(D) = -\sum_{(x, y)\in D}{y \cdot log(f{x})}$

In the later sections, you will see more details about the intuitive explanations for these different functions.

Backward propagation

After each forward propagation, you need a method to update the network parameters based on the loss function. Here, gradient descent is an iterative optimization algorithm, used to find the local optima. To find the local optima, we randomly select a point on the function and move in the direction of negative gradient until some stopping criteria are met (eg. loss doesn't change for a long time). To compute the gradient, you need to compute the derivative of the loss function over the network parameters (weights and biases). Mathmatically, the update equation for a general weight $W_{ij}^{t}$ abd bias $b_{i}^{t}$ is:

$W_{ij}^{t)} = W_{ij}^{t-1} - \alpha * \frac{\sigma L_{f}}{\sigma W_{ij}^{t-1}}$
$b_{i}^{t} = b_{i}^{t-1} - \alpha * \frac{\sigma L_{f}}{\sigma b_{i}^{t-1}}$

where $\alpha$ is the learning rate.

Let's Get Started!

After knowing the math behind building the neural network, you will now start to put everything together, which includes initialization, the activation function, the loss function for forward pass, and gradient descent for backward propagation. At the end of Part 1, you will build a simple two-layer neural network from scratch and train your network to do digit classification.

Network Initialization [ 0 pts]

Initialization of network weights is one of the more important techniques for training network successfully. You might think about why you couldn't just directly initalize the weights with all zeros. The reason is because initializing all the parameters (weights and biases) in the network to zero will result in the output of every neuron become zero during the forward propagation, meaning that when you do the back propagation, you will get same gradient for all the parameters as you feed them with the same input. Hence, it is impossible for us to update our parameters and get closer and closer to the optimal solution.

To properly handle initialization, you will use Xavier initialization - formula (16), where $Var[weight] = \frac{2}{(n_{in} + n_{out})}$, where $n_{in}$ is the dimensionality of the input layer and $n_{out}$ is the dimensionality of the output layer. You will use a uniform distribution $weight \sim U[-\frac{\sqrt 6}{\sqrt (n_{in} + n_{out})}, \frac{\sqrt 6}{\sqrt (n_{in} + n_{out})}]$to sample random numbers. The function is provided as follows.

Sigmoid Activation Function [ 10 pts]

Now, you will move on implementing the forward propagation for your neural network, where you need to implement activation funtions. Activation functions are mathematical equations that determine the output of a neural network. The function is attached to each neuron in the network, and determines whether it should be activated or not, based on whether each neuron’s output is relevant for the model’s prediction.

You want to implement the sigmoid activation function, along with forward propagation for a single layer with an activation function, namely $y = \sigma (XW + b)$, returning the output and intermediate results for an $N \times D$ dimension input $X$, with examples ($N$ dimension) along the rows, data dimensions ($D$ dimension) along the columns. For example, if our $X$ consists of 10 points under 2d coordinate representing by $(x, y)$, then $X$ will have dimension as 10 x 2 ($N=10, D=2$) If you visualize the sigmoid activation function, it will look like the figure below. Notice the range of the output is clamped between zero and one.

Follow the math formula of sigmoid activation:

$\sigma (x) = \frac{1}{1 + exp(x)}$

to implement the following function with NumPy. You can call $np.exp(x)$ to express exponential function of $x$.

Softmax Function [ 15 pts]

Since we are using neural networks for classification, a common output activation function to use for the last layer is the softmax function. This will allow us to turn the real value, possibly negative values into a set of probabilities (vector of positive numbers that sum to 1). Implement the softmax function, which is defined as below for each index $i$ in a vector $x$:

$\text{softmax}(x_{i}) = \frac{exp(x_{i})}{\sum_{j}{exp(x_{j})}} $

Here, we usually add an offset $c = -max (x_{i})$ to the input $x_i$ to make sure the output of the numerator is within the range $(0, 1]$. If we don't apply any offset $c$, meaning that $c=0$ in this formula, then the output range of the numerator will be $(0, infinite]$, which might cause an overflow calculation. This trick utilizes the fact that softmax function is invariant to translation, that is:

$\text{softmax}(x) = \text{softmax}(x + c), \forall c \in \mathbb{R} $

Loss Function (Cross-Entropy)[ 15 pts]

In this last part of the forward pass, you want to evaluate the prediction of the network with its ground truth labels. The loss function generally used for classification is the cross-entropy loss, which is defined as the following:

$L_{f}(D) = -\sum_{(x, y)\in D}{y \cdot log(f{x})}$

where $D$ is the full training dataset of data samples $x (N \times 1 $ vectors, $N =$ dimensionality of data) and labels $y (C \times 1 $ one-hot vectors, $C = $number of classes).

Also, you'll need to know the accuracy of the current performance, which is defined as the number of correct predictions over all predictions. That is, it's the proportion of correct predictions.

Forward Propagation[ 20 pts]

You now have all the components needed for computing the forward pass, which should be as follows:

  1. Input ($X$) $\rightarrow$
  2. Linear Layer* ($Y = XW + B$) $\rightarrow$
  3. Activation Layer* ($Y = \sigma (XW + b)$) $\rightarrow$
  4. Loop through Linear Layer (step 2) then Activation layer (step 3) $\rightarrow$
  5. Output Layer (softmax) $\rightarrow$
  6. Compute Loss and Accuracy

*Here, you'll want to modularize the "Linear Layer ($Y = XW + B$) $\rightarrow$ Activation Layer ($y = \sigma (XW + b)$)" part so you can reuse it.

Please note the use of the name argument which is used for clarification to label the layers.

Gradient Computation[ 20 pts]

After finishing the forward pass, now we can look closer into the backward pass, which will help optimize the netowrk by updating its parameters (weights and biases). Given an neural network and a loss function, backward proporgation calculates the gradient of the loss function with respect to the neural network's weights. For each non-linear function (eg. $Y=XW + b$), the derivative will just be 1 no matter what is the input, as the linear function is a line with constant slope = 1. For non-linear function, we need to compute the derivative manually however. In this assignment, we will only need to look into the derivative of sigmoid function, which can be derived as seen on this page.

Try to compute it again yourself, and implement the function below according to the formula.

First derivative of a linear function [10pts]

First derivative of a sigmoid function [10pts]

Backward Propagation[ 0 pts]

You now have all the components necessary for computing the backward pass. You can apply the chain rule to backpropagate the gradient from the output to the input.

Combining the gradient computation from before along with the update equation for a general weight $W_{ij}^{t}$ and bias $b_{i}^{t}$ yields:

$W_{ij}^{t)} = W_{ij}^{t-1} - \alpha * \frac{\sigma L_{f}}{\sigma W_{ij}^{t-1}}$
$b_{i}^{t} = b_{i}^{t-1} - \alpha * \frac{\sigma L_{f}}{\sigma b_{i}^{t-1}}$

Train Models with NIST [ 20 pts]

You'll now test your model on the NIST dataset for hand-written digit classification. You'll need a training loop and a testing loop to put the forward pass and the backprop together for your two-layer network.

The dimensions of each input image are 32 x 32 pixels. The image values are organized as a single 1024-dimensional vector which is multiplied by W(1). Because of this, each row of W(1) is a weight. The rows of W(1) can be reshaped into a 32 x 32 image in order to suggest to which types of images each hidden unit is sensitive.

You are provided with three data .mat files to use for this section.

The data in nist36 train.mat represent samples for each of the 26 upper-case letters of the alphabet and the 10 digits. You can use this set to train your network.

The validation set in nist36 valid.mat contains samples from each class and should be used in the training loop to validate how well the network is performing on data on which it is not in the training set. This will help to identify overfitting.

The data in nist36 test.mat should be used for the final evaluation on your best model to see how well it will generalize to new unseen data.

The following sections will go step-by-step describing how to implement the training and testing loops. At the end of this question, you will be able to visualize what your network learned.

First, load your data.

Next, initialize your hyperparameters.

Then, split your data into batches.

Now, initialize your network layers [5pts].

In this cell, run your training loop [10pts]. This might take a little while to run. You should see both training and validation data for each of the max_iters you set up earlier. With max_iters=50, you'll see 100 rows of data from this cell. Note: it's recommended to maintain the same Name argument you used in the earlier "initialize weights" part or you may get an error.

You should visualize that data you just produced to see the trend for the training and validation accuracy and loss. Think about how these curves should look and then check your results in the plots.

Now take a look at the accuracy [5pts]. This should be a relatively high number if your model is working well!

Visualize the network

Now you'll see what your network learned when it passes through different layers. Since your network is a simple two-layer network with only one hidden layer and an output layer, you can visualize the weights of network (1) right after initialization (2) from the hidden layer (3) from the output layer. Run the code below and try to describe what you discover! Does this visualization look organized to you?

Do another visualization using the output of the first layer. Does this look different to you? How so?

Now visualize the results of the output layer and compare the resulting plots to the ground truth images. How do these images look to you?

Confusion Matrix

A confusion matrix is a technique for summarizing the performance of a classification algorithm. The number of correct and incorrect predictions are summarized with count values and broken down by each class. This is the key to the confusion matrix.The confusion matrix shows the ways in which your classification model is confused when it makes predictions, and can tell you from where the model frequently make mistakes.

Run the following code and point out a few pairs of classes that are most commonly confused. These will appear as highlighted regions or spots in the triangles of the plot.

Part 2: PyTorch for Digit Classification [ 0 pts]

While you were able to derive manual back-propagation rules for sigmoid and fully-connected layers, wouldn’t it be nice if someone did that for lots of useful primatives and made it fast and easy to use for general computation? Meet automatic differentiation.

Since you have high-dimensional inputs (images) and low-dimensional outputs (a scalar loss), it turns out forward mode AD is very efficient. Popular AD packages include PyTorch (Facebook) and Tensorflow (Google). Autograd provides its own replacement for NumPy operators and is a drop-in replacement for NumPy, except you can ask for gradients now. The other two are able to act as shim layers for cuDNN, an implementation of AD made by Nvidia for use on their GPUs.

Since GPUs are able to perform large amounts of math much faster than CPUs, this makes the former two packages very popular for researchers who train large networks. Tensorflow asks you to build a computational graph using its API and then is able to pass data through that graph. PyTorch builds a dynamic graph and allows you to mix autograd functions with normal Python code much more smoothly, so it is currently more popular among people who want to build their own neural network for computer vision tasks.

You will now use PyTorch as a framework. Many computer vision projects use neural networks as a basic building block, so familiarity with one of these frameworks is a good skill to develop. Here, you basically replicate and slightly expand your handwritten character recognition networks, but do it in PyTorch instead of from scratch. For this section, you will rewrite and retrain your fully-connected network on NIST36 in PyTorch and then plot training accuracy and loss over time.

Although you are given the PyTorch code directly, you are encouraged to check PyTorch's official tutorial and try to train a convolutional neural network with PyTorch on different datasets such as MNIST or EMNIST.

PyTorch Installation[ 0 pts]

Follow the official guideline to install PyTorch and then import the necessary libraries on the following section.

Train Models with NIST36 (PyTorch) [ 0 pts]

We already provide you the PyTorch version of training NIST36. Run the code and plot the accuracy and loss of the training and validation set. Make sure you understand the training loop and are able to relate each functional portion of the NumPy version to the PyTorch version. This chunk of code will be extremely helpful if you want to utilize a neural network for classification tasks in your capstone project!